It is ok to practice hard regarding the commands we learned the last time, however I did not emphasize much.
You might wonder why don’t we manipulate the data set and picture it instead of learning microscope details. For example, calculate relative frequency by group variables and just plot them.
I agree. But in this case, you need to learn how to manipulate data.
Let’s interpret the following codes:
setwd("~/Documents/ibs_course/BUS240/data")
load('gss_sm.rda')
head(gss_sm, 10)
## # A tibble: 10 × 32
## year id ballot age childs sibs degree race sex region incom…¹ relig
## <dbl> <dbl> <labe> <dbl> <dbl> <lab> <fct> <fct> <fct> <fct> <fct> <fct>
## 1 2016 1 1 47 3 2 Bache… White Male New E… $17000… None
## 2 2016 2 2 61 0 3 High … White Male New E… $50000… None
## 3 2016 3 3 72 2 3 Bache… White Male New E… $75000… Cath…
## 4 2016 4 1 43 4 3 High … White Fema… New E… $17000… Cath…
## 5 2016 5 3 55 2 2 Gradu… White Fema… New E… $17000… None
## 6 2016 6 2 53 2 2 Junio… White Fema… New E… $60000… None
## 7 2016 7 1 50 2 2 High … White Male New E… $17000… None
## 8 2016 8 3 23 3 6 High … Other Fema… Middl… $30000… Cath…
## 9 2016 9 1 45 3 5 High … Black Male Middl… $60000… Prot…
## 10 2016 10 3 71 4 1 Junio… White Male Middl… $60000… None
## # … with 20 more variables: marital <fct>, padeg <fct>, madeg <fct>,
## # partyid <fct>, polviews <fct>, happy <fct>, partners <fct>, grass <fct>,
## # zodiac <fct>, pres12 <labelled>, wtssall <dbl>, income_rc <fct>,
## # agegrp <fct>, ageq <fct>, siblings <fct>, kids <fct>, religion <fct>,
## # bigregion <fct>, partners_rc <fct>, obama <dbl>, and abbreviated variable
## # name ¹income16
# practice 1
religion_count <- gss_sm %>%
group_by(religion) %>%
summarize(N = n()) %>% arrange(-N)
religion_count
## # A tibble: 6 × 2
## religion N
## <fct> <int>
## 1 Protestant 1371
## 2 Catholic 649
## 3 None 619
## 4 Other 159
## 5 Jewish 51
## 6 <NA> 18
# practice 2
rel_by_region <- gss_sm %>%
group_by(bigregion, religion) %>%
summarize(N = n()) %>%
mutate(freq = N/sum(N), pct = round((freq*100),0))
## `summarise()` has grouped output by 'bigregion'. You can override using the
## `.groups` argument.
rel_by_region
## # A tibble: 24 × 5
## # Groups: bigregion [4]
## bigregion religion N freq pct
## <fct> <fct> <int> <dbl> <dbl>
## 1 Northeast Protestant 158 0.324 32
## 2 Northeast Catholic 162 0.332 33
## 3 Northeast Jewish 27 0.0553 6
## 4 Northeast None 112 0.230 23
## 5 Northeast Other 28 0.0574 6
## 6 Northeast <NA> 1 0.00205 0
## 7 Midwest Protestant 325 0.468 47
## 8 Midwest Catholic 172 0.247 25
## 9 Midwest Jewish 3 0.00432 0
## 10 Midwest None 157 0.226 23
## # … with 14 more rows
Need to check whether the grouping is correct:
rel_by_region %>% group_by(bigregion) %>% summarize(total = sum(pct))
## # A tibble: 4 × 2
## bigregion total
## <fct> <dbl>
## 1 Northeast 100
## 2 Midwest 101
## 3 South 100
## 4 West 101
Now plot it:
p <- ggplot(rel_by_region, aes(x = bigregion, y = pct, fill = religion))
# what is the role of this line?
p+geom_col() # why not using geom_bar?
p + geom_col(position = "dodge") +
labs(x = "Region", y = "Percent", fill = "Religion")+
theme(legend.position = "top")
p + geom_col(position = "dodge2")
p <- ggplot(rel_by_region, aes(x = religion, y = pct, fill = religion))
p + geom_col(position = "dodge")
p <- ggplot(rel_by_region, aes(x = religion, y = pct, fill = religion))
p + geom_col(position = "dodge") + guides(fill = "none") + coord_flip()
p <- ggplot(rel_by_region, aes(x = religion, y = pct, fill = religion))
p + geom_col(position = "dodge") + guides(fill = "none") + coord_flip() +
facet_wrap(~ bigregion)
The coord_flip() function switches the x and y axes after the plot is made. It does not remap variables to aesthetics.
Which way is better?
Let’s use different dataset.
setwd("~/Documents/ibs_course/BUS240/data")
load('organdata.rda')
head(organdata, 10)
## # A tibble: 10 × 21
## country year donors pop pop_d…¹ gdp gdp_lag health healt…² pubhe…³
## <chr> <date> <dbl> <int> <dbl> <int> <int> <dbl> <dbl> <dbl>
## 1 Austral… NA NA 17065 0.220 16774 16591 1300 1224 4.8
## 2 Austral… 1991-01-01 12.1 17284 0.223 17171 16774 1379 1300 5.4
## 3 Austral… 1992-01-01 12.4 17495 0.226 17914 17171 1455 1379 5.4
## 4 Austral… 1993-01-01 12.5 17667 0.228 18883 17914 1540 1455 5.4
## 5 Austral… 1994-01-01 10.2 17855 0.231 19849 18883 1626 1540 5.4
## 6 Austral… 1995-01-01 10.2 18072 0.233 21079 19849 1737 1626 5.5
## 7 Austral… 1996-01-01 10.6 18311 0.237 21923 21079 1846 1737 5.6
## 8 Austral… 1997-01-01 10.3 18518 0.239 22961 21923 1948 1846 5.7
## 9 Austral… 1998-01-01 10.5 18711 0.242 24148 22961 2077 1948 5.9
## 10 Austral… 1999-01-01 8.67 18926 0.244 25445 24148 2231 2077 6.1
## # … with 11 more variables: roads <dbl>, cerebvas <int>, assault <int>,
## # external <int>, txp_pop <dbl>, world <chr>, opt <chr>, consent_law <chr>,
## # consent_practice <chr>, consistent <chr>, ccode <chr>, and abbreviated
## # variable names ¹pop_dens, ²health_lag, ³pubhealth
It contains a little more than a decade’s worth of information on the donation of organs for transplants in seventeen OECD countries.
country. Country name.
year. Year.
donors. Organ Donation rate per million population.
pop. Population in thousands.
pop_dens. Population density per square mile.
gdp. Gross Domestic Product in thousands of PPP dollars.
gdp_lag. Lagged Gross Domestic Product in thousands of PPP dollars.
health. Health spending, thousands of PPP dollars per capita.
health_lag Lagged health spending, thousands of PPP dollars per capita.
pubhealth. Public health spending as a percentage of total expenditure.
roads. Road accident fatalities per 100,000 population.
cerebvas. Cerebrovascular deaths per 100,000 population (rounded).
assault. Assault deaths per 100,000 population (rounded).
external. Deaths due to external causes per 100,000 population.
txp_pop. Transplant programs per million population.
world. Welfare state world (Esping Andersen.)
opt. Opt-in policy or Opt-out policy.
consent_law. Consent law, informed or presumed.
consent_practice. Consent practice, informed or presumed.
consistent. Law consistent with practice, yes or no.
ccode. Abbreviated country code.
Let’s interpret the following commands
organdata %>% select(1:6) %>% sample_n(size = 10)
## # A tibble: 10 × 6
## country year donors pop pop_dens gdp
## <chr> <date> <dbl> <int> <dbl> <int>
## 1 Switzerland 1995-01-01 13 7041 17.1 26304
## 2 Australia 1995-01-01 10.2 18072 0.233 21079
## 3 Switzerland 2000-01-01 14 7184 17.4 29837
## 4 Denmark 1997-01-01 11.4 5285 12.3 24676
## 5 United Kingdom NA NA 57238 23.6 16228
## 6 Germany NA NA NA NA NA
## 7 Australia NA NA 17065 0.220 16774
## 8 Italy 1999-01-01 13.7 57646 19.1 23729
## 9 Ireland 1999-01-01 18.7 3756 5.35 25936
## 10 Spain 1994-01-01 25 39166 7.74 15024
Lets’s start by naively graphing some of the data. We can take a look at a scatterplot of donors vs year.
p <- ggplot(organdata,aes(year,donors)) # when you practice, put all the argument names in
p + geom_point()
## Warning: Removed 34 rows containing missing values (geom_point).
The graph looks too rough.
p <- ggplot(data = organdata,
mapping = aes(x = year, y = donors))
p + geom_line(aes(group = country))
## Warning: Removed 34 row(s) containing missing values (geom_path).
Oh faceting. But need to check
table(organdata$country)
##
## Australia Austria Belgium Canada Denmark
## 14 14 14 14 14
## Finland France Germany Ireland Italy
## 14 14 14 14 14
## Netherlands Norway Spain Sweden Switzerland
## 14 14 14 14 14
## United Kingdom United States
## 14 14
p <- ggplot(data = organdata,
mapping = aes(x = year, y = donors))
p + geom_line(aes(group = country)) + facet_wrap(~ country)
## Warning: Removed 34 row(s) containing missing values (geom_path).
Let’s focus on the country-level variation ignoring time variation now. We can use geom_boxplot() to get a picture of variation by year across countries. Just as geom_bar() by default calculates a count of observations by the category you map to x, the stat_boxplot() function that works with geom_boxplot() will calculate a number of statistics that allow the box and whiskers to be drawn. We tell geom_boxplot() the variable we want to categorize by (here, country) and the continuous variable we want summarized (here, donors)
p <- ggplot(data = organdata, mapping = aes(x = country, y = donors))
p + geom_boxplot()
## Warning: Removed 34 rows containing non-finite values (stat_boxplot).
p <- ggplot(data = organdata, mapping = aes(x = country, y = donors))
p + geom_boxplot() + coord_flip()
## Warning: Removed 34 rows containing non-finite values (stat_boxplot).
See the order of faceting plots and box plots.
More clever way is to have the countries listed from high to low average donation rate. Think about reordering the country variable by the mean of donors.
reorder() has two arguments: 1. the categorical variable or factor that we want to reorder 2. by what? need to indicate that variable If you only give reorder() the first two required arguments, then by default it will reorder the categories of your first variable by the mean value of the second
p <- ggplot(data = organdata,
mapping = aes(x = reorder(country, donors),
y = donors))
p + geom_boxplot() + coord_flip()
## Warning: Removed 34 rows containing non-finite values (stat_boxplot).
Well, it did not work out well. See:
vec1 <- c(1,2,3)
vec2 <- c(4,5,NA)
mean(vec1)
## [1] 2
mean(vec2)
## [1] NA
In R, the default mean function will fail with an error if there are missing values in the variable you are trying to take the average of. ou must say that it is OK to remove the missing values when calculating the mean. This is done by supplying the na.rm=TRUE argument to reorder(), which internally passes that argument on to mean().
mean(vec2, na.rm = TRUE)
## [1] 4.5
p <- ggplot(data = organdata,
mapping = aes(x = reorder(country, donors, na.rm = TRUE),
y = donors))
p + geom_boxplot() + coord_flip()
## Warning: Removed 34 rows containing non-finite values (stat_boxplot).
p <- ggplot(data = organdata,
mapping = aes(x = reorder(country, donors, na.rm = TRUE),
y = donors,
fill = world))
p + geom_boxplot() + coord_flip() + theme(legend.position = "bottom") +
labs(x = NULL)
## Warning: Removed 34 rows containing non-finite values (stat_boxplot).
p <- ggplot(data = organdata,
mapping = aes(x = reorder(country, donors, na.rm = TRUE),
y = donors,
color = world))
p + geom_point() + coord_flip() + theme(legend.position = "bottom") +
labs(x = NULL)
## Warning: Removed 34 rows containing missing values (geom_point).
p <- ggplot(data = organdata,
mapping = aes(x = reorder(country, donors, na.rm = TRUE),
y = donors,
color = world))
p + geom_point(alpha = .5) + coord_flip() + theme(legend.position = "bottom") +
labs(x = NULL)
## Warning: Removed 34 rows containing missing values (geom_point).
geom_jitter works much like geom_point(), but randomly nudges each
observation by a small amount.
p <- ggplot(data = organdata,
mapping = aes(x = reorder(country, donors, na.rm = TRUE),
y = donors,
color = world))
p + geom_jitter() + coord_flip() + theme(legend.position = "bottom") +
labs(x = NULL)
## Warning: Removed 34 rows containing missing values (geom_point).
p <- ggplot(data = organdata,
mapping = aes(x = reorder(country, donors, na.rm = TRUE),
y = donors,
color = world))
p + geom_jitter(position = position_jitter(width = .15)) + coord_flip() + theme(legend.position = "bottom") +
labs(x = NULL)
## Warning: Removed 34 rows containing missing values (geom_point).
When we want to summarize a categorical variable that just has one point per category, we should use this approach as well. Think about this:
by_country <- organdata %>% group_by(consent_law, country) %>%
summarize(donors_mean = mean(donors, na.rm = TRUE),
donors_sd = sd(donors, na.rm = TRUE),
gdp_mean = mean(gdp, na.rm = TRUE),
health_mean = mean(health, na.rm = TRUE),
roads_mean = mean(roads, na.rm = TRUE),
cerevbas_mean = mean(cerebvas, na.rm = TRUE))
## `summarise()` has grouped output by 'consent_law'. You can override using the
## `.groups` argument.
by_country
## # A tibble: 17 × 8
## # Groups: consent_law [2]
## consent_law country donors_m…¹ donor…² gdp_m…³ healt…⁴ roads…⁵ cerev…⁶
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Informed Australia 10.6 1.14 22179. 1958. 105. 558.
## 2 Informed Canada 14.0 0.751 23711. 2272. 109. 422.
## 3 Informed Denmark 13.1 1.47 23722. 2054. 102. 641.
## 4 Informed Germany 13.0 0.611 22163. 2349. 113. 707.
## 5 Informed Ireland 19.8 2.48 20824. 1480. 118. 705.
## 6 Informed Netherlands 13.7 1.55 23013. 1993. 76.1 585.
## 7 Informed United Kingdom 13.5 0.775 21359. 1561. 67.9 708.
## 8 Informed United States 20.0 1.33 29212. 3988. 155. 444.
## 9 Presumed Austria 23.5 2.42 23876. 1875. 150. 769.
## 10 Presumed Belgium 21.9 1.94 22500. 1958. 155. 594.
## 11 Presumed Finland 18.4 1.53 21019. 1615. 93.6 771.
## 12 Presumed France 16.8 1.60 22603. 2160. 156. 433.
## 13 Presumed Italy 11.1 4.28 21554. 1757 122. 712.
## 14 Presumed Norway 15.4 1.11 26448. 2217. 70.0 662.
## 15 Presumed Spain 28.1 4.96 16933 1289. 161. 655.
## 16 Presumed Sweden 13.1 1.75 22415. 1951. 72.3 595.
## 17 Presumed Switzerland 14.2 1.71 27233 2776. 96.4 424.
## # … with abbreviated variable names ¹donors_mean, ²donors_sd, ³gdp_mean,
## # ⁴health_mean, ⁵roads_mean, ⁶cerevbas_mean
p <- ggplot(data = by_country,
mapping = aes(x = donors_mean,
y = reorder(country, donors_mean),
color = consent_law))
p + geom_point()
p <- ggplot(data = by_country,
mapping = aes(x = donors_mean,
y = reorder(country, donors_mean),
color = consent_law))
p + geom_point(size = 3) +
labs(x = 'Donor Procurement Rarte', y = "",
color = "Consent Law") +
theme(legend.position = "bottom")
If you don’t like, you might facet them
p <- ggplot(data = by_country,
mapping = aes(x = donors_mean,
y = reorder(country, donors_mean)))
p + geom_point(size = 3) +
facet_wrap( ~ consent_law) +
labs(x = 'Donor Procurement Rarte', y = "")
p <- ggplot(data = by_country,
mapping = aes(x = donors_mean,
y = reorder(country, donors_mean)))
p + geom_point(size = 3) +
facet_wrap( ~ consent_law, scales = "free_y") +
labs(x = 'Donor Procurement Rarte', y = "")
p <- ggplot(data = by_country,
mapping = aes(x = donors_mean,
y = reorder(country, donors_mean)))
p + geom_point(size = 3) +
facet_wrap( ~ consent_law, ncol = 1) +
labs(x = 'Donor Procurement Rarte', y = "")
Optional: if you want to delve deeper
p <- ggplot(data = by_country,
mapping = aes(x = reorder(country, donors_mean),
y = donors_mean))
p + geom_pointrange(mapping = aes(ymin = donors_mean - donors_sd,
ymax = donors_mean + donors_sd)) +
labs(x = '', y = "Donor Procurement Rate") + coord_flip()
Sometimes text itself provides clear information.
p <- ggplot(data = by_country,
mapping = aes(x = roads_mean, y = donors_mean))
p + geom_point() + geom_text(mapping = aes(label = country))
Make it to the left
p <- ggplot(data = by_country,
mapping = aes(x = roads_mean, y = donors_mean))
p + geom_point() + geom_text(mapping = aes(label = country), hjust = 0)
Make it to the right
p <- ggplot(data = by_country,
mapping = aes(x = roads_mean, y = donors_mean))
p + geom_point() + geom_text(mapping = aes(label = country), hjust = 1)
hjust will often fail because the space is added in proportion to the length of the label. The result is that longer labels move further away from their points than you want.
library(ggrepel)
p <- ggplot(data = by_country,
mapping = aes(x = roads_mean, y = donors_mean))
p + geom_point() + geom_text_repel(mapping = aes(label = country))
It has subset function, useful for outliers
p <- ggplot(data = by_country,
mapping = aes(x = gdp_mean, y = health_mean))
p + geom_point()
p <- ggplot(data = by_country,
mapping = aes(x = gdp_mean, y = health_mean))
p + geom_point() +
geom_text_repel(data = subset(by_country, gdp_mean > 25000),
mapping = aes(label = country))
p <- ggplot(data = by_country,
mapping = aes(x = gdp_mean, y = health_mean))
p + geom_point() +
geom_text_repel(data = subset(by_country,
gdp_mean > 25000 | health_mean < 1500 |
country %in% "Belgium"),
mapping = aes(label = country))
p <- ggplot(data = organdata,
mapping = aes(x = roads, y = donors, color = world))
p + geom_point()
## Warning: Removed 34 rows containing missing values (geom_point).
p <- ggplot(data = organdata,
mapping = aes(x = roads, y = donors, color = world))
p + geom_point() +
scale_x_log10()+
scale_y_continuous(breaks = c(5,15,25),
labels = c("Five","Fifteen","Twenty Five"))
## Warning: Removed 34 rows containing missing values (geom_point).
p <- ggplot(data = organdata,
mapping = aes(x = roads, y = donors, color = world))
p + geom_point() +
scale_color_discrete(
labels = c("Corp","Liberal","Social Demo", "No classified"))+
labs(x = "Road Deaths",
y = "Donor Procurement",
color = "Welfare State")
## Warning: Removed 34 rows containing missing values (geom_point).
p <- ggplot(data = organdata,
mapping = aes(x = roads, y = donors, color = world))
p + geom_point() +
labs(x = "Road Deaths",
y = "Donor Procurement")+
guides(color ="none")
## Warning: Removed 34 rows containing missing values (geom_point).
Comment on the following codes 1